Summary

The document reads .ris bibliographic files, filters selected studies, and categorises data sources into Articles, Packages, and Kaggle. All entries are classified by use-case type, data type, sport, population, and synthetic-generation potential.

A final evaluation scores all datasets according to predefined criteria, comparing their suitability to be used to generate synthetic dataset using Statistical and/or GAN-based approaches.

  • Article: Peer-reviewed data used in a published paper.
  • Package: Dataset available through an CRAN or Python package.
  • Kaggle: Dataset accessible from the Kaggle.com platform.

Aim

To compile, classify, and evaluate publicly available sports datasets based on data source, methodological characteristics, and use-case categories.

Approach

  1. Import bibliographic datasets and conduct manual and Shiny-based screening.

  2. Merge Article, Package, and Kaggle datasets into a single structured dataset.

  3. Classify each dataset into five use-case types:

    • Movement
    • Tactical
    • Performance
    • Injury
    • Player-focused
  4. Extract detailed metadata: population, sport domain, geographic region, study design, sample size, and data type.

  5. Apply an eight-criterion scoring system to evaluate dataset quality.

  6. Rank datasets separately for Statistical and GAN-based applications.

Results

  1. A total of 50 datasets were included after screening. These were distributed across source types:
Source Type Number of Datasets
Articles 32
Packages 12
Kaggle 6
  1. Datasets are divided into GAN-based (video/image oriented) and Statistical (tabular, sensor, physiological, or survey-based):
Category Number of Datasets Data Types Population Most Frequent Sports Top 3 (by Score)
GAN-based 16 Video, Image Athlete Multiple, Basketball, Fitness TeamTrack, C-Sports, SportsMOT
Statistical 34 Tabular, Physiological, Medical Record, Survey, Accelerometer Athlete, Multiple Football, Baseball, Basketball, Fitness MTS-5, NCAA-ISP, LLBD
  1. Datasets were assigned to five use-case categories:
Use Case Description Examples
Movement Pose estimation, motion tracking, biomechanical analysis LLBD, HTHARD, RBD, WEAR
Tactical Formation recognition, event detection, game-situation analysis TeamTrack, C-Sports, SportsMOT, MultiSports
Performance Fatigue prediction, load and performance monitoring MTS-5, ScopeSense, PMData, Lahman
Injury Impact simulation, risk modelling, unsafe event reconstruction NCAA-ISP, NEISS, FFTSC-10Y, NHL-ATR
Player Player identity, jersey recognition, technical skill prediction SportsHHI, NFBDB2026, nflfastR/hoopR/nhlapi

Data Preparation

# Inspect the files
list.files("data/database/ris/")
## [1] "ebscoSport.ris"     "ieee.ris"           "qut.ris"           
## [4] "scienceDirect.ris"  "springerNature.ris" "wos.ris"
# Error in webofS, remove empty line
wos <- readLines("data/database/ris/wos.ris")
wos <- wos[wos != ""] 
writeLines(wos, "data/database/ris/wos.ris")
# Read all as list, conver to df
files <- list.files("data/database/ris", pattern = "\\.ris$", full.names = T)

bibliography <- read_bibliography(filename = files, return_df = T)
bibliography
# Title preparition
bibliography$titleLower<-tolower(bibliography$title)
bibliography$titleLower<-strip(bibliography$titleLower, apostrophe.remove = TRUE)
head(bibliography$titleLower)
## [1] "secondary prevention of musculoskeletal sports injuries a scoping review of early detection and early intervention strategies"                                                
## [2] "the effects of rule changes in footballcode team sports a systematic review"                                                                                                  
## [3] "how physical education teachers are positioned in models scholarship a scoping review"                                                                                        
## [4] "physical education from lgbtq students perspective a systematic review of qualitative studies"                                                                                
## [5] "the altmetric score has a stronger relationship with article citations than journal impact factor and open access status a crosssectional analysis of sport sciences articles"
## [6] "methods of the national collegiate athletic association injury surveillance program â through â"
# Check for duplicates
unique(bibliography$titleLower[duplicated(bibliography$titleLower)])
##  [1] "crosssectional and longitudinal associations of active travel organised sport and physical education with accelerometerassessed moderatetovigorous physical activity in young people the international childrenâs accelerometry database"
##  [2] "match score dataset for team ball sports"                                                                                                                                                                                                
##  [3] "collective sports a multitask dataset for collective activity recognition"                                                                                                                                                               
##  [4] "tgc reid a dataset for sport event reidentification in the wild"                                                                                                                                                                         
##  [5] "regular sports services dataset of demographic frequency and service level agreement"                                                                                                                                                    
##  [6] "aspset an outdoor sports pose video dataset with d keypoint annotations"                                                                                                                                                                 
##  [7] "dataset for the analysis of tv viewer response to live sport broadcasts and sponsor messages"                                                                                                                                            
##  [8] "sports work strategy of college counselors based on mysql database big data analysis"                                                                                                                                                    
##  [9] "epidemiology of testicular trauma in sports analysis of the national electronic injury surveillance system database"                                                                                                                     
## [10] "administrative databases used for sports medicine research demonstrate significant differences in underlying patient demographics and resulting surgical trends"                                                                         
## [11] "analysis of research trends on elbow pain in overhead sports a bibliometric study based on web of science database and vosviewer"                                                                                                        
## [12] "the racial and sexual differences in emergency department visits for sportrelated spine fracture injuries a neiss database study"                                                                                                        
## [13] "comprehensive dataset on presarscov infection sportsrelated physical activity levels disease severity and treatment outcomes insights and implications for covid management"                                                             
## [14] "analysis of a comprehensive dataset influence of vaccination profile types and severe acute respiratory syndrome coronavirus reinfections on changes in sportsrelated physical activity one month after infection"
# Remove duplicated titles, keeping the first unique entry
bibliography <- bibliography[!duplicated(bibliography$titleLower), ]

# Check that duplicates are gone
any(duplicated(bibliography$titleLower))
## [1] FALSE
dim(bibliography)
## [1] 278 104
# Use shiny app to filter based on Abstract
# screen_abstracts(bibliography)

Bibliography

Filtering the dataset to keep only the selected articles, reducing the number from 278 to 89.

bibliographyRev <- read.csv("data/database/bibliography/bibliographyRev.csv")
bibliographyRev <- bibliographyRev %>%
                   filter(screened_abstracts == "selected") %>%
                   dplyr::select(author, title, year, keywords, abstract, doi, titlelower,
                                 filename)

# write.csv(bibliographyRev, "data/database/bibliography/bibliographyRevSelected.csv",
#           row.names = FALSE)

dim(bibliographyRev)
## [1] 89  8
colnames(bibliographyRev)
## [1] "author"     "title"      "year"       "keywords"   "abstract"  
## [6] "doi"        "titlelower" "filename"

Database

From the output file above, an excel file was created manually to categorise the databases into Articles (sheet = databaseAR), Packages (R and Python) (sheet = databasePA), and Kaggle (sheet = databaseOT).

  1. Articles: Databases were searched using the keywords “sport” AND “database” or “sport” AND “dataset” for publicly available datasets.

  2. Packages: Active and maintained packages were selected with databases related to athletes were included.

  3. Kaggle: In the datasets category, the keywords used were “injuries”, “sport”, “NFL”, and “AFL”. In the competitions category, only “sport” was used. For both categories, only the top 10 datasets were reviewed.

# List all the sheets 
excel_sheets("data/database/bibliography/bibliographyRevSelected.xlsx")
## [1] "bibliography" "databaseAR"   "databasePA"   "databaseOT"   "database"    
## [6] "rank"         "summary"

The database sheet contains the merged data from all files, and the summary sheet will be used to generate insights and visualisations.

# Read the summary sheet
summary <- read_excel("data/database/bibliography/bibliographyRevSelected.xlsx", 
                           sheet = "summary")

colnames(summary)
##  [1] "column"               "study.title"          "dataset.name"        
##  [4] "dataset"              "dataset.type"         "methods.model"       
##  [7] "use.case.type"        "use.case"             "aim.dataset"         
## [10] "valid.data"           "total.score"          "synthetic.generation"
## [13] "country"              "year.start"           "year.end"            
## [16] "year.range"           "population.age.range" "population.type"     
## [19] "population.sex"       "population"           "sample.overall"      
## [22] "sample.raw"           "sample.size"          "study.design"        
## [25] "sport.type"           "sports.covered"       "data.type"           
## [28] "variables.collected"  "literature.category"

General Analysis of the datasets

colorPalette <- RColorBrewer::brewer.pal(8, "Set2")

f1 <- plot_ly(summary,
              x = ~population.type, y = ~sample.overall,
              type = 'scatter', mode = 'markers',
              color = ~population.type, colors = colorPalette,
              size = ~sample.overall, sizes = c(10, 60),
              marker = list(opacity = 0.7, line = list(width = 1, color = '#333')),
              hoverinfo = 'text',
              text = ~paste('Dataset:', dataset.name,
                            '<br>Samples:', sample.overall,
                            '<br>Population:', population.type),
              showlegend = FALSE)

f2 <- plot_ly(summary %>% count(sport.type),
                x = ~sport.type, y = ~n, type = 'bar',
                color = ~sport.type, colors = colorPalette,
                showlegend = FALSE) 

f3 <- plot_ly(summary %>% count(data.type),
                x = ~n, y = ~reorder(data.type, n),
                type = 'bar', orientation = 'h',
                color = ~data.type, colors = colorPalette,
                showlegend = FALSE) 

f4 <- plot_ly(summary,
              x = ~data.type, y = ~sample.overall,
              type = 'scatter', mode = 'markers',
              color = ~valid.data, colors = c('#E15759', '#59A14F'),
              size = ~sample.overall, sizes = c(10, 50),
              marker = list(opacity = 0.7),
              hoverinfo = 'text',
              text = ~paste('Dataset:', dataset.name,
                            '<br>Type:', data.type,
                            '<br>Valid:', valid.data,
                            '<br>Samples:', sample.overall)) 

fig <- subplot(f1, f2, f3, f4, nrows = 2, margin = 0.20) %>%
  layout(
    plot_bgcolor = "rgba(0,0,0,0)",
    paper_bgcolor = "rgba(0,0,0,0)",
    showlegend = TRUE,
    legend = list(orientation = "h", x = 0.55, y = -0.15),
    annotations = list(
      list(text = "Sample Size by Population Type", 
           x = 0.20, y = 1.05, showarrow = FALSE, 
           xref='paper', yref='paper', font=list(size=14)),
      list(text = "Datasets by Sport Type", 
           x = 0.80, y = 1.05, showarrow = FALSE, 
           xref='paper', yref='paper', font=list(size=14)),
      list(text = "Data Type Distribution", x = 0.20, y = 0.47, 
           showarrow = FALSE, xref='paper', yref='paper', font=list(size=14)),
      list(text = "Sample Size vs Data Type (by Validity)", 
           x = 0.80, y = 0.47, showarrow = FALSE, xref='paper',
           yref='paper', font=list(size=14))
    )
  )

fig

Analysis of datasets by country and type

# Duplicate the rows by column and country. 
# Dataset with multiple countries will have multiple rows
summaryMap <- summary %>%
  mutate(country = str_split(country, ",")) %>%
  unnest(country) %>%
  mutate(country = str_trim(country))

summaryMap
# Generate the information to display in the map
countrySummary <- summaryMap %>%
  group_by(country) %>%
  summarise(
    nDatasets = n(),
    datasets = paste(unique(column), collapse = "; "),
    studyDesigns = paste(unique(study.design), collapse = "; "),
    sampleRange = paste0("Min: ", min(sample.raw, na.rm = TRUE),
                         " | Max: ", max(sample.overall, na.rm = TRUE)),
    population = paste(unique(population.type), collapse = "; "),
    sex = paste(unique(population.sex), collapse = "; "),
    sports = paste(unique(sport.type), collapse = "; "),
    reference = paste(unique(dataset), collapse = "; ")
  )

countrySummary
# Create hover text with the information above
countrySummary <- countrySummary %>%
  mutate(hoverText = paste0(
    "<b>", country, "</b><br>",
    "Datasets: ", nDatasets, "<br>",
    "Study Design: ", studyDesigns, "<br>",
    "Sample Range: ", sampleRange, "<br>",
    "Population: ", population, "<br>",
    "Sex: ", sex, "<br>",
    "Sports: ", sports, "<br>",
    "Dataset Names: ", datasets, "<br>",
    "Reference: ", reference
  ))

countrySummary

The following map does not display the International (n = 9) and Commonwealth countries(n = 1) datasets.

Additionally, the plot allows us to visualise the different types of datasets:

  1. Article: Data validated and used in a paper.
  2. Package: Dataset can be extracted from a CRAN or Python.
  3. Kaggle: Available from the website Kaggle.com.
# Combine plots
fp <- subplot(
  mapP, barP,
  nrows = 2, shareX = F,
  heights = c(0.50, 0.50),
  margin = 0.8
) %>%
  layout(
    title = "Global Distribution of Public Sports Datasets"
  )

fp 

Analysis of Variables by Dataset

# Prepare the dataset selecting the relevant columns
variables <- summary %>%
  select(column, sport.type, data.type, variables.collected, dataset) %>%
  mutate(
    sourceType = case_when(
      str_detect(dataset, regex("Kaggle", ignore_case = TRUE)) ~ "Kaggle",
      str_detect(dataset, regex("CRAN|Python", ignore_case = TRUE)) ~ "Package",
      TRUE ~ "Article"
    ),
    variables.collected = str_replace_all(
      variables.collected,
      regex("(\\d+\\.)\\s*", ignore_case = TRUE),
      "<br>• "
    ),
    variables.collected = paste0("<b>Variables:</b>", variables.collected)
  ) 

variables

The following plot links three sections:

  1. Datasets on the left
  2. Sports in the middle
  3. Variables on the right

Each flow represents a connection between these sections and is colored by its data source type (Kaggle-Orange, Package-Green, or Article-Blue). Move the mouse over a flow to see the type of variables included in that connection.

# Create Node List 
nodes <- data.frame(
  name = unique(c(variables$column, variables$sport.type, variables$data.type))
)

# Function to map each label to numeric index
get_index <- function(x) match(x, nodes$name) - 1

# Links
links <- bind_rows(
  variables %>%
    transmute(
      source = get_index(column),
      target = get_index(sport.type),
      type = sourceType,
      hover = variables.collected
    ),
  variables %>%
    transmute(
      source = get_index(sport.type),
      target = get_index(data.type),
      type = sourceType,
      hover = variables.collected
    )
)

color_map <- c(
  "Kaggle" = "#FFB347",
  "Package" = "#77DD77",
  "Article" = "#779ECB"
)
links$color <- color_map[links$type]

# Plotly Sankey 
fig <- plot_ly(
  type = "sankey",
  arrangement = "snap",
  node = list(
    label = nodes$name,
    color = "grey",
    pad = 15,
    thickness = 20,
    line = list(color = "black", width = 0.5)
  ),
  link = list(
    source = links$source,
    target = links$target,
    value = rep(1, nrow(links)),          
    color = links$color,
    customdata = links$hover,           
    hovertemplate = "%{customdata}<extra></extra>"
  )) 

fig <- fig %>%
  layout(
    title = list(
      text = "Variables Across Sports Datasets",
      font = list(size = 18, color = "#333", family = "Roboto")
    ),
    font = list(size = 12),
    margin = list(l = 10, r = 10, t = 60, b = 10),
    annotations = list(
      list(
        x = 0.00, y = 1.05,
        text = "<b>Datasets</b>",
        showarrow = FALSE,
        xref = "paper", yref = "paper",
        font = list(size = 14, color = "#FFB347", family = "Roboto")
      ),
      list(
        x = 0.50, y = 0.76,
        text = "<b>Sports</b>",
        showarrow = FALSE,
        xref = "paper", yref = "paper",
        font = list(size = 14, color = "#77DD77", family = "Roboto")
      ),
      list(
        x = 0.95, y = 0.75,
        text = "<b>Variables</b>",
        showarrow = FALSE,
        xref = "paper", yref = "paper",
        font = list(size = 14, color = "#779ECB", family = "Roboto")
      )
    )
  )

fig

Ranking of Datasets

Manually will proceed analysing and scoring all the datasets based on the following table:

We have added the rank sheet to the main file to store the scores. Two new columns were generated manually named as TotalScore representing the scores assigned to each dataset and literatureCategory representing the category assigned by the literature review analysis (GAN-based or Statistical).

# Select the variable of interest
summaryScore <- summary %>%
  select(column, total.score, literature.category, valid.data, 
         population.type, sport.type, data.type)

summaryScore
# Rename the columns 
summaryScore <- summaryScore %>%
  rename(dataset = column, 
         group = literature.category, 
         value = total.score) %>% 
  mutate(group = as.factor(group)) %>%
  arrange(group, desc(value))
# Create two dataframes to separate plots. Plots will have hover with each dataset info
statsD <- filter(summaryScore, group == "Statistical")
ganD <- filter(summaryScore, group == "GAN-based")

colorPalette <- setNames(
  colorRampPalette(brewer.pal(min(max(length(unique(summaryScore$data.type)), 3), 8), 
                              "Set2"))(length(unique(summaryScore$data.type))),
  unique(summaryScore$data.type))

stat <- plot_ly(statsD,
                x = ~value,
                y = ~reorder(dataset, value),
                type = 'bar',
                orientation = 'h',
                color = ~data.type, 
                colors = colorPalette,
                hoverinfo = 'text',
                marker = list(line = list(width = 1.5)),
                text = ~paste(
                  "<b>Dataset:</b>", dataset,
                  "<br><b>Value:</b>", round(value, 3),
                  "<br><b>valid.data:</b>", valid.data,
                  "<br><b>Population:</b>", population.type,
                  "<br><b>Sport:</b>", sport.type,
                  "<br><b>Data Type:</b>", data.type
                ))

gan <- plot_ly(ganD,
               x = ~value,
               y = ~reorder(dataset, value),
               type = 'bar',
               orientation = 'h',
               color = ~data.type,
               colors = colorPalette,
               hoverinfo = 'text',
               marker = list(line = list(width = 1.5)),
               text = ~paste(
                 "<b>Dataset:</b>", dataset,
                 "<br><b>Value:</b>", round(value, 3),
                 "<br><b>valid.data:</b>", valid.data,
                 "<br><b>Population:</b>", population.type,
                 "<br><b>Sport:</b>", sport.type,
                 "<br><b>Data Type:</b>", data.type
               )) 

legend <- data.frame(
  data.type = unique(summaryScore$data.type),
  color = unname(colorPalette[unique(summaryScore$data.type)])
)

legendM <- plot_ly()
for(i in seq_len(nrow(legend))) {
  legendM <- legendM %>%
    add_trace(
      type = "scatter",
      mode = "markers+text",
      x = 1, y = i,
      marker = list(size = 14, color = legend$color[i]),
      text = legend$data.type[i],
      textposition = "right",
      hoverinfo = "none",
      showlegend = FALSE
    ) %>%
    layout( title = "Data Type",
    xaxis = list(
      visible = FALSE,
      zeroline = FALSE,
      showgrid = FALSE,
      showticklabels = FALSE
    ),
    yaxis = list(
      visible = FALSE,
      zeroline = FALSE,
      showgrid = FALSE,
      showticklabels = FALSE
    )
  )
}

p <- subplot(
  subplot(stat, gan, nrows = 2, shareX = T, titleY = TRUE), legendM, 
  widths = c(0.70, 0.30)) %>%
  layout(title = "Ranking of datasets by Approach (Statistical vs GAN-based Approaches)",
         showlegend = F, 
         yaxis = list(title = "GAN-based", automargin = TRUE),
         yaxis2 = list(title = "Statistical", automargin = TRUE))

p

Use Cases

# Wrap text to multiple lines for readability
wrap_text <- function(x, width = 25) str_wrap(x, width = width)

summary <- summary %>%
  mutate(
    methods.model = wrap_text(methods.model, 100),
    use.case = wrap_text(use.case, 100),
    aim.dataset = wrap_text(aim.dataset, 200),
    variables.collected = wrap_text(variables.collected, 200),
    country = wrap_text(country, 100),
    methods.model = wrap_text(methods.model, 100),
    year.range = wrap_text(year.range, 100),
    population = wrap_text(population, 100),
    sample.size = wrap_text(sample.size, 100),
    sports.covered = wrap_text(sports.covered, 100),
    data.type = wrap_text(data.type, 100),
    study.design = wrap_text(study.design, 100),
    literature.category = wrap_text(literature.category, 100)
  )

summary
# Level 1: dataset.type
lvl1 <- summary %>%
  distinct(dataset.type) %>%
  mutate(
    ids    = dataset.type,
    labels = dataset.type,
    parents = ""
  )

# Level 2: use.case.type
lvl2 <- summary %>%
  distinct(dataset.type, use.case.type) %>%
  mutate(
    ids    = paste(dataset.type, use.case.type, sep = "-"),
    labels = use.case.type,
    parents = dataset.type
  )

# Level 3: sport.type
lvl3 <- summary %>%
  distinct(dataset.type, use.case.type, sport.type) %>%
  mutate(
    ids    = paste(dataset.type, use.case.type, sport.type, sep = "-"),
    labels = sport.type,
    parents = paste(dataset.type, use.case.type, sep = "-")
  )

# Level 4: dataset

lvl4 <- summary %>%
  distinct(
    dataset.type, use.case.type, sport.type, dataset,
    dataset.name, methods.model, use.case,
    aim.dataset, variables.collected, country, year.range,
    population, sample.size, sports.covered, data.type,
    study.design, literature.category
  ) %>%
  mutate(
    aim.dataset = str_replace_all(
      aim.dataset,
      regex("\\b(\\d+\\.)", ignore_case = TRUE),
      "<br>\\1 "
    ),
    variables.collected = str_replace_all(
      variables.collected,
      regex("\\b(\\d+\\.)", ignore_case = TRUE),
      "<br>\\1 "
    ),
    
    ids = paste(dataset.type, use.case.type, sport.type, dataset, sep = "-"),

    labels = paste0(
      "<b>Dataset:</b> ", dataset, "<br>",
      "<b>Name:</b> ", dataset.name, "<br>",
      "<b>Method:</b> ", methods.model, "<br>",
      "<b>Use Case:</b> ", use.case, "<br>",
      "<b>Aim:</b> ", aim.dataset, "<br>",
      "<b>Variables:</b> ", variables.collected, "<br>",
      "<b>Country:</b> ", country, "<br>",
      "<b>Year Range:</b> ", year.range, "<br>",
      "<b>Population:</b> ", population, "<br>",
      "<b>Sample Size:</b> ", sample.size, "<br>",
      "<b>Sports Covered:</b> ", sports.covered, "<br>",
      "<b>Data Type:</b> ", data.type, "<br>",
      "<b>Study Design:</b> ", study.design, "<br>",
      "<b>Category:</b> ", literature.category
    ),

    parents = paste(dataset.type, use.case.type, sport.type, sep = "-")
  )


treeD <- bind_rows(
  lvl1,
  lvl2,
  lvl3,
  lvl4
)

treeD
# Insert colours
levels <- c(
  "#E69F00", # Level 1 (dataset.type)
  "#009E73", # Level 2 (use.case.type)
  "#0072B2", # Level 3 (sport.type)
  "#000000"  # Level 4 (dataset)
)


treeD <- treeD %>%
  mutate(
    level = case_when(
      parents == "" ~ 1,
      grepl("^[^-]+-[^-]+$", ids) ~ 2,
      grepl("^[^-]+-[^-]+-[^-]+$", ids) ~ 3,
      TRUE ~ 4
    ),
    colors = levels[level]
  )

plot_ly(
  treeD,
  type = "treemap",
  ids = ~ids,
  labels = ~labels,
  parents = ~parents,
  marker = list(colors = ~colors),
  textinfo = "label+children"
)